AirBnB Data Analysis

After all the set up for database in SQL was completed, let's we try to fetch the data using SQL query in python and analyze them to perform useful business insight.

Here below the several library that we will use to analyzing AirBnB Data :

In [1]:
# Data wrangling
import pandas as pd
import numpy as np
import re

# Connect MySQL
import mysql.connector

# Data Visualization
import plotly.express as px

# Draw map
import folium
from folium.plugins import MarkerCluster

%matplotlib inline

Create connection into the database

Frist of all we need to create a connection in our database.

In [2]:
# conn =  mysql.connector.connect( 
#         host = "db4free.net",
#         port = 3306,
#         user = "user",
#         password = "password!",
#         database = "database-name"
#         ) 
In [3]:
conn =  mysql.connector.connect( 
        host = "localhost",
        port = 3306,
        user = "root",
        password = "",
        database = "airbnb_data"
        ) 

Let's check the available table in our database.

In [4]:
pd.read_sql_query("SHOW TABLES", conn)
Out[4]:
Tables_in_airbnb_data
0 host_info
1 listing
2 review
3 reviewer

Number of observation from listing, host, and review table

In [5]:
# number of listings
number_listing = pd.read_sql_query("SELECT COUNT(*) AS freq FROM listing", conn)
In [6]:
number_listing['freq'][0]
Out[6]:
4388
In [7]:
# number of host
number_host = pd.read_sql_query("SELECT COUNT(*) AS freq FROM host_info", conn)
In [8]:
number_host['freq'][0]
Out[8]:
1205
In [9]:
# number of review
number_review = pd.read_sql_query("SELECT SUM(number_of_reviews) as num_review FROM listing", conn)
In [10]:
number_review
Out[10]:
num_review
0 53675.0

Research Question

For the first business insight, we will take a look deep down into variables price and property to get useful information about the corelation between them and extract the hidden information. Here's several business question we want to answer :

Room Listing, Price and Property

  • How is the price distribution based on the geographic location (latitude and longitude)?
  • What is the most common property type?
  • What is the top and bottom 10 property type by average price
  • What is the most common room type available?
  • What is the most common amenities provided?
  • Is there any correlation between amenities provided with the price?
  • Is there any correlation between room price and the review score?

Host

  • Who are the top 10 host based on revenue?
  • Is there any difference in review score between superhost and normal host?
  • Is there any difference in response rate between superhost and normal host?
  • What is the most commonly verified host information?

Price

How is the price distribution based on the geographic location (latitude and longitude)?

The first business question is to get information about price distribution based on geograpichal location. Information will perform in map visualization, so we will using folium package from python to visualize longitude and latitude inside the data and creating popup to display the detail information in each point.

In [73]:
query = '''
    SELECT name, SUM(price) AS price, longitude, latitude
    FROM listing
    GROUP BY name, longitude, latitude
    '''
df_query = pd.read_sql_query(query,conn)
In [74]:
df_query.head()
Out[74]:
name price longitude latitude
0 ! BEST ! LOCATION private room at Central Orchard 82.0 103.83704 1.30532
1 !! CozyRoom@City Center,Little India,FarrerPar... 58.0 103.85331 1.31516
2 !Stay in City Center, next to Farrer Park MRT (1) 61.0 103.85444 1.31609
3 ##New Cozy Room @ Farrer Park 50.0 103.85671 1.31160
4 #2 Private Cozy Rooms @ Farrer Park 103.0 103.85831 1.31144
In [75]:
df_query.isna().sum()
Out[75]:
name         0
price        0
longitude    0
latitude     0
dtype: int64
In [76]:
map_sg = folium.Map(location=[df_query.latitude.mean(), df_query.longitude.mean()], zoom_start=5)
In [77]:
marker_cluster = folium.plugins.MarkerCluster().add_to(map_sg)
In [78]:
for name, lat, lon, price in \
zip(df_query['name'],df_query['latitude'], df_query['longitude'], df_query['price']):
    popup = folium.Html("<b>"  + name + "</b><br>" +\
                        "Price: {:,}".format(price) + "<br>", script = True)
    popup = folium.Popup(popup, max_width=2650)
    folium.Marker(location = [lat, lon],
                  popup = popup
                 ).add_to(marker_cluster)
In [17]:
map_sg
Out[17]:
Make this Notebook Trusted to load map: File -> Trust Notebook

What is the most common property type?

Second business question is what is the most common property type inside the data?

We can use COUNT and GROUP BY syntax from SQL query to calculate how many frequency does the each property appear on the data. After get the calculation we can order them using ORDER BY syntax and select only top 10 information using LIMIT query.

In [87]:
query =  '''
    SELECT property_type, COUNT(property_type) AS frequency
    FROM listing
    GROUP BY property_type
    ORDER BY frequency DESC
    LIMIT 10
    '''
df_query = pd.read_sql_query(query,conn)
In [88]:
df_query.head()
Out[88]:
property_type frequency
0 Private room in apartment 953
1 Entire condominium 845
2 Entire apartment 545
3 Private room in condominium 375
4 Entire serviced apartment 334
In [89]:
fig = px.bar(df_query, x='frequency', y='property_type', 
             title="Most Common Property Type in Listings",
             orientation='h',
            labels={
                'property_type' : 'Property Type',
                'frequency' : 'Frequency'
            })

fig.show()

Visualization above show that private room in apartment is the most common used in AirBnB hotel, whilst the least used is private room in twonhouse.

What is the top and bottom 10 property type by average price?

The third question is what is the top and bottom 10 property type based on their average price.

When we only calculate the average price without considering the number of appearances in each property type, the result will lead to irrelevant information. Why? Because there are several property type that have deluxe and high-priced, but only few or even one hotel who have those facility. So, we need to filter the property type which have more than 20 frequency from data.

In [21]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
In [90]:
query =  '''
    SELECT property_type, average_price, frequency
    FROM 
        ( SELECT property_type, AVG(price) AS average_price, COUNT(*) as frequency
          FROM listing
          GROUP BY property_type
          ) as new_data
    WHERE frequency >=20
    ORDER BY average_price DESC
    LIMIT 10
    '''
df_query = pd.read_sql_query(query,conn)
In [91]:
df_query = df_query.sort_values(by='average_price', ascending=True)
In [92]:
df_query
Out[92]:
property_type average_price frequency
9 Room in boutique hotel 145.564103 195
8 Shared room in apartment 174.818182 44
7 Room in serviced apartment 182.513514 37
6 Private room 201.307692 26
5 Entire condominium 201.968047 845
4 Room in hotel 208.206107 131
3 Entire apartment 232.601835 545
2 Private room in hostel 240.685714 35
1 Entire house 249.291667 72
0 Entire serviced apartment 258.494012 334
In [93]:
plot1 = go.Bar(x=df_query['average_price'], 
               y=df_query['property_type'],
               name='Top 10',
               orientation='h',
               marker=dict(
                   color = 'blue'
               )
              )
In [94]:
query =  '''
    SELECT property_type, average_price, frequency
    FROM 
        ( SELECT property_type, AVG(price) AS average_price, COUNT(*) as frequency
          FROM listing
          GROUP BY property_type
          ) as new_data
    WHERE frequency >=20
    ORDER BY average_price
    LIMIT 10
    '''
df_query = pd.read_sql_query(query,conn)
In [95]:
df_query.sort_values(by='average_price', inplace=True)
In [96]:
plot2 = go.Bar(x=df_query['average_price'], y=df_query['property_type'],
                orientation='h',
               name='Bottom 10',
               marker=dict(
                   color = 'red'
               )
              )
In [97]:
sub = make_subplots(rows=2, cols=1)
sub.append_trace(plot1, 1,1)
sub.append_trace(plot2, 2, 1)
In [98]:
sub.update_layout(height=600, width=800, title_text="Top and Bottom 10 Property based on Average Price",xaxis2=dict(range=[0,270]))

Entire service apartment become the top listing with highest average price, following with Entire house and Room in boutique hotel in top 10 listing with the highest average price. Whilst Shared room in bed and breakfast have lowest average price.

What is the top room type by average price?

Above information shown about top and bottom property type based on average price, but how about the room type?

Using the same technique and analysis, let's perform the hidden information between room type and average price!

In [99]:
query = '''
    SELECT room_type, average_price, frequency
    FROM 
        ( SELECT room_type, AVG(price) AS average_price, COUNT(*) as frequency
          FROM listing
          GROUP BY room_type
          ) as new_data
    WHERE frequency >=20
    ORDER BY average_price DESC
    '''
df_query = pd.read_sql_query(query,conn)
In [100]:
fig = go.Figure(data=[go.Pie(labels=df_query['room_type'], values=df_query['average_price'], hole=.4)])
fig.update_traces(hole=.4, marker=dict(colors=['#e57373', '#f06292', '#4db6ac','#81c784']))
fig.update_layout(
    title_text="Room Type by Average Price")
fig.show()

Unlike property type columns which have many kind of type property, room type have only 4 unique value inside, they are : Hotel room, entire home/apt, private room, and shared room. And the top provided room type in AirBnB data is Entire home/apt, while the less provided room type is Shared room

Is there any correlation between amenities provided with the price?

In the last business question about the price, let's gather information wether amenities has correlation with price.

In [101]:
query = '''
    SELECT amenities, price
    FROM listing
    '''
df_query = pd.read_sql_query(query, conn)

Most amenities

First of all we need to check and create a dataframe to perform information about detail amenities and their frequency.

In [102]:
most_amenities = df_query['amenities'].str.split(', ', expand=True).stack().value_counts().to_frame("Total").head(10)
In [103]:
most_amenities.sort_values(by='Total',ascending=True)
Out[103]:
Total
TV 3154
Dedicated workspace 3297
Iron 3374
Kitchen 3425
Hangers 3582
Washer 3752
Essentials 3797
Wifi 4231
Air conditioning 4254
Long term stays allowed 4283
In [104]:
fig = px.bar(most_amenities, x='Total', y=most_amenities.index, 
             title="Top Amenities",
            orientation='h',
            labels={
                'index' : 'Amenities',
                'Total' : 'Frequency'
            })

fig.show()

Almost all of the available listing has Air conditioning and Long term stays allowed, followed by Essentials, Wifi, and Washer. More than 60% of all listing also has Dedicated workspace and Kitchen, which may require more spaces outside the bedroom.

Let's check whether amenities have correlation with price. Commonly where listing have complete ameneties the price will increase.

In [105]:
df_query['count_amenities'] = df_query.amenities.str.strip().str.split(',').apply(len)
In [106]:
df_query.head()
Out[106]:
amenities price count_amenities
0 Wifi, Elevator, Long term stays allowed, Air c... 81.0 7
1 Wifi, Kitchen, Elevator, Long term stays allow... 80.0 12
2 Wifi, Kitchen, Elevator, Air conditioning, TV,... 67.0 8
3 Hair dryer, Kitchen, Free street parking, Keyp... 177.0 25
4 Coffee maker, Hair dryer, Kitchen, Free street... 81.0 21
In [107]:
px.scatter(df_query,
           x = np.log10(df_query['price']),
           y = df_query['count_amenities'],
           title='Correlation Between Amenities and Price',
           opacity=0.5,
           labels={
                'x' : 'Price',
                'count_amenities' : 'Count Amenities'
            })
C:\Users\dyahn\anaconda3\lib\site-packages\pandas\core\series.py:679: RuntimeWarning:

divide by zero encountered in log10

Correlation Between Price and Overall Experience

Let's check whether listing with higher price will give higher overall experience as well using simple scatterplot. We will scale the listing price to log10 units. We only collect listing with non-NULL value for the review scores rating. We will also filter the data by collecting listing that has more than 10 number of reviews.

In [108]:
# Collect Required Columns
query = """
SELECT price, review_scores_rating
FROM listing
WHERE review_scores_rating IS NOT NULL AND number_of_reviews > 10
"""

df_query = pd.read_sql_query(query, conn)
In [109]:
px.scatter(df_query,
           x=np.log10(df_query.price),y='review_scores_rating',
           title='Correlation Between Price and Overall Experience',
           labels={
               'x' : 'Price',
               'review_scores_rating' : 'Scores Rating'
           },
           opacity=0.5
          )

Host

We will continue answering the research question by looking at the host data.

Top Host by Earning

We will see who is the top host based on the total earning cumulated from his/her listings. We will use the following formula to calculate the total earning:

$$ Total\ earning = \Sigma_{i=1}^n price_i\times number\ of\ reviews_i \times minimum\ nights_i $$

Notes:

  • n: Number of listing for the host
  • price: Price of ith listing from the host
  • number of reviews: Number of reviews of ith listing
  • minimum nights: Minimum night of stay of ith listing

Unfortunately, we don't have the detailed data on the number of stay for each customer, so we will use the minimum nights as the number of stay so what we actually calculate is the minimum total earning. We use the number of reviews as the proxy of number of customers.

In [42]:
query = """
SELECT host_id, host_name, SUM(total_earning) as total_earning, COUNT(*) as number_of_listing, AVG(price) as average_price
FROM
    (SELECT listing.host_id, host_info.host_name, price, number_of_reviews, minimum_nights, 
            price * number_of_reviews * minimum_nights as total_earning
    FROM listing
    LEFT JOIN host_info
    ON listing.host_id = host_info.host_id
    WHERE host_name IS NOT NULL
    ) as new_table
GROUP BY host_id, host_name
ORDER BY SUM(total_earning) DESC
"""

df_query = pd.read_sql(query, conn)
df_query.head()
Out[42]:
host_id host_name total_earning number_of_listing average_price
0 23722617 Alex 8829936.0 57 86.736842
1 14521708 Shirley 7861590.0 25 83.840000
2 178584113 Oakwood Premier AMTD 6797030.0 6 447.166667
3 25062093 Natasha K 6769000.0 2 94.000000
4 7511976 Paul 5235000.0 2 2250.000000
In [43]:
# Get top 15 host by number of listing
df_query.sort_values("number_of_listing", ascending= False, inplace = True)
df_viz = df_query.head(15).copy()

df_viz.sort_values("number_of_listing", inplace = True)
In [44]:
fig = px.bar(df_viz, x='number_of_listing', y='host_name', 
             title="Top Host by Number of Listing",
            orientation='h',
            labels={
                'number_of_listing' : 'Number of Listing',
                'host_name' : 'Host Name'
            })
fig.show()

Only a handful of hosts has more than 5 listings. Now we will continue looking at the top 15 based on the Total Earning generated.

In [45]:
# Get top 15 host by total_earning
df_query.sort_values("total_earning", ascending= False, inplace = True)
df_viz = df_query.head(15).copy()

df_viz.sort_values("total_earning", inplace = True)
In [46]:
fig = px.bar(df_viz, x='total_earning', y='host_name', 
             title="Top Host by Total Earning",
            orientation='h',
            labels={
                'total_earning' : 'Total Earning',
                'host_name' : 'Host Name'
            })
fig.show()

So, the list of host name is different from the top host by the number of listing and the top host by total earning. This indicates that a higher number of listings doesn't guarantee to give more earnings.

Does higher average price of listing from a single host correlate with higher total earning? Let's answer this questions using scatterplot.

In [47]:
fig = px.scatter(df_query, x = 'average_price', y = 'total_earning',
                title="Average Price vs Total Earning",
                opacity=0.5,
                labels={
                    'average_price' : 'Average Price',
                    'total_earning' : 'Total Earning'
                })

fig.show()

As we can see, there is no visible pattern between average price and total earning. However, some of the most highest total earnings are generated by host with relatively low average price from his/her listings. Therefore, higher average price from a host doesn't guarantee to give him/her a higher total earning

Superhost

According to Airbnb, superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests.

In here we are interested to check if there is any difference in the distribution of review scores for listing owned by a superhost and normal host.

In [48]:
query = """
SELECT review_scores_rating, listing.host_id, host_info.host_name, host_info.host_is_superhost
FROM listing
LEFT JOIN host_info
ON listing.host_id = host_info.host_id
WHERE review_scores_rating IS NOT NULL and host_is_superhost IS NOT NULL
"""

df_query = pd.read_sql(query, conn)

# Transform superhost as Boolean/Logical
df_query.host_is_superhost = df_query.host_is_superhost.astype('bool')
In [49]:
df_query
Out[49]:
review_scores_rating host_id host_name host_is_superhost
0 98.0 23666 Maryanne True
1 91.0 227796 Sujatha False
2 100.0 244567 Sherry False
3 94.0 266763 Francesca False
4 98.0 266763 Francesca False
... ... ... ... ...
2518 80.0 383118468 Fun House False
2519 60.0 383888082 Kelly False
2520 97.0 386195881 Angelina False
2521 80.0 389588368 Hotel NuVe False
2522 100.0 390067872 Kieron False

2523 rows × 4 columns

In [50]:
fig = px.histogram(df_query, x='review_scores_rating', color='host_is_superhost',
                  title='Review Scores Rating',
                  labels={
                    'review_scores_rating' : 'Scores Rating',
                    'count' : 'Count'
                  })
fig.show()

Both superhost and normal host has the same peak for the distribution, indicating that on average there is no significant difference on review scores rating or the overall experience. However, superhosts has thinner distribution, indicating that superhost tend to have higher review score due to low variability.

Now we will look at the response rate and the acceptance rate between normal host and the superhost. The following is the detail description about response rate and acceptance rate according to Airbnb:

  • Your response rate measures how consistently you respond within 24 hours to guest inquiries and booking requests.
  • Your acceptance rate measures how often you accept or decline reservations. Guest inquiries are not included in the calculation of your acceptance rate.
In [51]:
query = """
SELECT host_id, host_name, host_response_rate, host_acceptance_rate, host_is_superhost
FROM host_info
WHERE host_response_rate IS NOT NULL and host_acceptance_rate IS NOT NULL
"""

df_query = pd.read_sql(query, conn)
df_query.head()
Out[51]:
host_id host_name host_response_rate host_acceptance_rate host_is_superhost
0 23666 Maryanne 100% 100% 1
1 519472 Joey 0% 0% 0
2 800558 Kim (金) 40% 0% 0
3 951965 Arun 100% 0% 0
4 1030128 Lena 100% 100% 1
In [52]:
# Transform superhost as Boolean/Logical
df_query.host_is_superhost = df_query.host_is_superhost.astype('bool')

for i in ['host_response_rate', 'host_acceptance_rate']:
    df_query[i] = list(map(lambda x: float(re.sub("[%]", "", x)), df_query[i] ))
    
df_query.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497 entries, 0 to 496
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   host_id               497 non-null    int64  
 1   host_name             497 non-null    object 
 2   host_response_rate    497 non-null    float64
 3   host_acceptance_rate  497 non-null    float64
 4   host_is_superhost     497 non-null    bool   
dtypes: bool(1), float64(2), int64(1), object(1)
memory usage: 16.1+ KB

Let's see the distribution of the response rate and acceptance rate.

In [53]:
fig = px.histogram(df_query, x='host_response_rate', color='host_is_superhost',
                  title='Host Response Rate',
                  labels={
                    'host_response_rate' : 'Host Response Rate',
                    'count' : 'Count'
                  })
fig.show()
In [54]:
fig = px.histogram(df_query, x='host_acceptance_rate', color='host_is_superhost',
                  title='Host Acceptance Rate',
                  labels={
                    'host_acceptance_rate' : 'Host Acceptance Rate',
                    'count' : 'Count'
                  })
fig.show()

Based on plots above, the distribution of response rate and acceptance rate is the same, indicating that there is no signficant difference between a normal host and a superhost.

Host Verification

When you host on Airbnb, you may be asked to provide information such as your legal name, date of birth, or government ID so it can then be verified. This is a mandatory step for a host to prevent fraud and other financial crimes. Let's deep dive into what kind of informations are mostly filled and verified from a host.

In [55]:
query = """
SELECT host_id, host_name, host_verifications, host_identity_verified
FROM host_info
WHERE host_identity_verified IS NOT NULL
"""

df_query = pd.read_sql(query, conn)
df_query.host_identity_verified = df_query.host_identity_verified.astype('bool')
df_query
Out[55]:
host_id host_name host_verifications host_identity_verified
0 23666 Maryanne email, phone, reviews, offline_government_id, ... True
1 227796 Sujatha email, phone, facebook, reviews, offline_gover... True
2 244567 Sherry email, phone, reviews, kba, work_email True
3 266763 Francesca email, phone, facebook, reviews, jumio, offlin... True
4 343908 Matthew email, phone, facebook, reviews, work_email False
... ... ... ... ...
1169 389031312 Harbour Ville email, phone False
1170 389588368 Hotel NuVe email, phone, identity_manual True
1171 390067872 Kieron email, phone True
1172 393078492 Cc phone True
1173 393244617 Wesley email, phone, work_email True

1174 rows × 4 columns

Let's check if there is any host that is not verified yet.

In [56]:
df_query[ df_query.host_identity_verified == False]
Out[56]:
host_id host_name host_verifications host_identity_verified
4 343908 Matthew email, phone, facebook, reviews, work_email False
6 519472 Joey email, phone False
7 581033 Nicholas email, phone, facebook, reviews False
11 813925 Happy email, phone False
12 951965 Arun email, phone, facebook, reviews False
... ... ... ... ...
1159 377477740 Mohd phone False
1162 381931771 Grace email, phone False
1164 383888082 Kelly email, phone False
1165 384032484 Glariant email, phone False
1169 389031312 Harbour Ville email, phone False

438 rows × 4 columns

There are a lot of host that is yet to be verified. Let's also check if there is any host that has missing value or no host_verification.

In [57]:
df_query[ df_query.host_verifications.isna() ]
Out[57]:
host_id host_name host_verifications host_identity_verified

Let's check the most commonly verified identity.

In [58]:
# Remove missing verification
df_query = df_query[ df_query.host_verifications.isna() == False]
df_query.reset_index(drop = True, inplace = True)

list_verify = []
for i in range(df_query.shape[0]):
    
    verify_i = df_query.host_verifications[i].split(", ")
    list_verify.extend(verify_i)
    
df_verify = pd.DataFrame(list_verify, columns = {"verified_info"})
In [59]:
df_verify
Out[59]:
verified_info
0 email
1 phone
2 reviews
3 offline_government_id
4 government_id
... ...
5663 phone
5664 phone
5665 email
5666 phone
5667 work_email

5668 rows × 1 columns

In [67]:
# Create frequency dataframe
df_viz = pd.DataFrame(df_verify['verified_info'].value_counts())
df_viz.reset_index(inplace = True)
df_viz.rename(columns = {'verified_info':"freq"}, inplace = True)

#Create ratio column
df_viz['ratio'] = df_viz.freq/df_query.shape[0]*100

#Get and sort the first 15 rows
df_viz = df_viz.head(15)
df_viz.sort_values('ratio', inplace = True)
In [68]:
df_viz
Out[68]:
index freq ratio
14 zhima_selfie 6 0.511073
13 weibo 7 0.596252
12 manual_online 9 0.766610
11 manual_offline 27 2.299830
10 google 43 3.662692
9 work_email 156 13.287905
8 facebook 161 13.713799
7 identity_manual 326 27.768313
6 selfie 350 29.812606
5 offline_government_id 495 42.163543
4 jumio 579 49.318569
3 reviews 607 51.703578
2 government_id 693 59.028961
1 email 1038 88.415673
0 phone 1165 99.233390
In [71]:
fig = px.bar(df_viz, x = 'ratio', y='index',
            orientation='h',
            title='Verified Information',
            labels={
                'ratio' : 'Percentage of Host',
                'index' : 'Verified Info'
            })
fig.show()

Phone number and email are the most common way to verify a host and has been done by more than 80% of all hosts, followed by the governemnt id of the host.

Conclusion

We have done some data analysis to better understand the information regarding the room listing and host from Airbnb in Bangkok. We have saw what is the most common room type available, is there any correlation between room price and the review score, who are the top 10 host, etc. You can continue further by building a beautiful analytics dashboard in Tableau or using any libraries from python, such as plotly dash, flask, or streamlit.

Don't forget to close your database connection if you are finished.

In [ ]:
# Close Connection
mydb.close()